HDDS-12791. Make Retention Service suspendable#9426
HDDS-12791. Make Retention Service suspendable#9426ChenSammi merged 5 commits intoapache:HDDS-8342from
Conversation
There was a problem hiding this comment.
Pull request overview
This PR makes the KeyLifecycleService suspendable at runtime without requiring an OM restart, adding the ability to dynamically enable/disable the service and query its status.
- Makes
ozone.lifecycle.service.enabledreconfigurable so the service can be stopped/started via reconfiguration - Adds a new
ozone admin om lifecycle statusCLI command to query the service state - Implements additional
shouldRun()checks in the service to allow graceful suspension during bucket processing
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto | Adds GetLifecycleServiceStatus request/response proto messages and type enum |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/OmUtils.java | Marks GetLifecycleServiceStatus as a read-only operation |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocol/OzoneManagerProtocol.java | Adds getLifecycleServiceStatus() method to the OM protocol interface |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java | Implements client-side translation for the new getLifecycleServiceStatus() RPC |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerRequestHandler.java | Adds request handling for GetLifecycleServiceStatus command type |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java | Implements getLifecycleServiceStatus() and adds reconfiguration handler for lifecycle service enabled flag |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/KeyLifecycleService.java | Changes isServiceEnabled to AtomicBoolean, adds setServiceEnabled() and status() methods, and adds runtime suspension checks |
| hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/OMAdmin.java | Registers the new LifecycleSubCommand |
| hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/LifecycleSubCommand.java | New subcommand for lifecycle-related admin operations |
| hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/LifecycleStatusSubCommand.java | Implements the status query CLI command |
| hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/FinalizationStatusSubCommand.java | Refactored to use try-with-resources for proper client cleanup |
| hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/service/TestKeyLifecycleService.java | Adds test case for lifecycle service status query functionality |
| hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/shell/TestOzoneShellHA.java | Adds integration test for the lifecycle status CLI command |
| hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/reconfig/TestOmReconfiguration.java | Adds lifecycle service enabled configuration to reconfigurable properties test |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| private String reconfKeyLifecycleServiceEnabled(String newVal) { | ||
| boolean enabled = StringUtils.isEmpty(newVal) ? |
There was a problem hiding this comment.
Can we throw exception if newVal is null? I think "null" is an invalid input, so it's better response to user with correct info. Will the message in exception be passed to client and show as the output of CLI?
| LOG.warn("Failed reconfiguration for '{}'. KeyLifecycleService is not initialized.", | ||
| OZONE_KEY_LIFECYCLE_SERVICE_ENABLED); | ||
| } | ||
| getConfiguration().setBoolean(OZONE_KEY_LIFECYCLE_SERVICE_ENABLED, enabled); |
There was a problem hiding this comment.
Let's change the configuration first, and then update the service state.
| UpgradeFinalization.StatusAndMessages progress = | ||
| client.queryUpgradeFinalizationProgress(upgradeClientID, false, true); | ||
| System.out.println(progress.status()); | ||
| try (OzoneManagerProtocol client = parent.createOmClient(omServiceId, omHost, false)) { |
There was a problem hiding this comment.
Is this a relevant change ?
| * Handler of ozone admin status command. | ||
| */ | ||
| @Command( | ||
| name = "status", |
There was a problem hiding this comment.
We have new CLI to query the service status, and leverage the reconfiguration framework for the service start/stop, can we use consolidate to one entry, for example, new CLI for both start/stop/status? Current both status CLI and reconfiguration with both ozone-site.xml content change and reconfiguration CLI, feels a little complicated for user.
There was a problem hiding this comment.
Added suspend subcommand.
| keyTable.iterator(omMetadataManager.getBucketKey(volumeName, bucketName))) { | ||
| while (keyTblItr.hasNext()) { | ||
| if (!shouldRun()) { | ||
| return; |
There was a problem hiding this comment.
Can we add a log message here too about the abort of bucket key evaluation?
| return result; | ||
| } | ||
|
|
||
| if (!shouldRun()) { |
There was a problem hiding this comment.
When it comes here, the key evaluation is finished, the rest of work is sending the expired remaining key to delete/rename. I think we can let this run to finish, instead of return here.
The most time consuming part of DB key evaluation iteration, send keys to delete/rename doesn't take much time.
| Set<String> runningBuckets = new HashSet<>(inFlight.keySet()); | ||
| return GetLifecycleServiceStatusResponse.newBuilder() | ||
| .setIsRunning(!runningBuckets.isEmpty()) | ||
| .setIsEnabled(isServiceEnabled.get()) |
There was a problem hiding this comment.
Can we combine the IsRunning and IsEnabled to one state. If IsRuning is the result of whether there is bucket task running or not, this info is already covered by RunningBuckets fields.
There was a problem hiding this comment.
Removed the setIsRunning.
| * Set isServiceEnabled. | ||
| * @param enabled whether enable the lifecycle Service | ||
| */ | ||
| public void setServiceEnabled(boolean enabled) { |
There was a problem hiding this comment.
Can we reuse the existing suspended variable, and suspend(), resume() function, instead of this new setServiceEnabled() function, as the current suspended variable and it involved function is actually not used?
There was a problem hiding this comment.
I reused the suspended variable and the the isServiceEnabled only based on the Configuration, cannot modify by user
| repeated string runningBuckets = 2; | ||
| } | ||
|
|
||
| message SuspendLifecycleServiceRequest { |
There was a problem hiding this comment.
Can we add an option in the SuspendLifecycleServiceRequest so that we can resume the service once suspend is no longer needed?
There was a problem hiding this comment.
Add a resume command. new we can suspend and resume the lifecycle service
| } | ||
|
|
||
| message GetLifecycleServiceStatusResponse { | ||
| required bool isEnabled = 1; |
There was a problem hiding this comment.
Can we add one more state, to indicate that service is suspened/paused or not?
There was a problem hiding this comment.
Added the suspened flag
|
Thanks @xichen01 for updating the patch. The last commit looks overall good to me, +1. |
What changes were proposed in this pull request?
This PR makes the KeyLifecycleService suspendable at runtime without requiring an OM restart.
ozone.lifecycle.service.enabledreconfigurable, the KeyLifecycleService can be stopped by reconfiguring this configuration.Example:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-12791
How was this patch tested?
unit test.